Accelerator or accelerated functions as a service using networked processing units

ABSTRACT

Various approaches for deploying and controlling distributed accelerated compute operations with the use of infrastructure processing units (IPUs) and similar networked processing units are disclosed. A system for orchestrating acceleration functions in a network compute mesh is configured to access a flowgraph, the flowgraph including data producer-consumer relationships between a plurality of tasks in a workload; identify available artifacts and resources to execute the artifacts to complete each of the plurality of tasks, wherein an artifact is an instance of a function to perform a task of the plurality of tasks; determine a configuration assigning artifacts and resources to each of the plurality of tasks in the flowgraph; and schedule, based on the configuration, the plurality of tasks to execute using the assigned artifacts and resources.

PRIORITY CLAIM

This application claims the benefit of priority to U.S. ProvisionalPatent Application No. 63/425,857, filed Nov. 16, 2022, and titled“COORDINATION OF DISTRIBUTED NETWORKED PROCESSING UNITS”, which isincorporated herein by reference in its entirety.

TECHNICAL FIELD

Embodiments described herein generally relate to data processing,network communication, and communication system implementations ofdistributed computing, including the implementations with the use ofnetworked processing units (or network-addressable processing units)such as infrastructure processing units (IPUs) or data processing units(DPUs).

BACKGROUND

System architectures are moving to highly distributed multi-edge andmulti-tenant deployments. Deployments may have different limitations interms of power and space. Deployments also may use different types ofcompute, acceleration and storage technologies in order to overcomethese power and space limitations. Deployments also are typicallyinterconnected in tiered and/or peer-to-peer fashion, in an attempt tocreate a network of connected devices and edge appliances that worktogether.

Edge computing, at a general level, has been described as systems thatprovide the transition of compute and storage resources closer toendpoint devices at the edge of a network (e.g., consumer computingdevices, user equipment, etc.). As compute and storage resources aremoved closer to endpoint devices, a variety of advantages have beenpromised such as reduced application latency, improved servicecapabilities, improved compliance with security or data privacyrequirements, improved backhaul bandwidth, improved energy consumption,and reduced cost. However, many deployments of edge computingtechnologies—especially complex deployments for use by multipletenants—have not been fully adopted.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings, which are not necessarily drawn to scale, like numeralsmay describe similar components in different views. Like numerals havingdifferent letter suffixes may represent different instances of similarcomponents. Some embodiments are illustrated by way of example, and notlimitation, in the figures of the accompanying drawings in which:

FIG. 1 illustrates an overview of a distributed edge computingenvironment, according to an example;

FIG. 2 depicts computing hardware provided among respective deploymenttiers in a distributed edge computing environment, according to anexample;

FIG. 3 depicts additional characteristics of respective deploymentstiers in a distributed edge computing environment, according to anexample;

FIG. 4 depicts a computing system architecture including a computeplatform and a network processing platform provided by an infrastructureprocessing unit, according to an example;

FIG. 5 depicts an infrastructure processing unit arrangement operatingas a distributed network processing platform within network and datacenter edge settings, according to an example;

FIG. 6 depicts functional components of an infrastructure processingunit and related services, according to an example;

FIG. 7 depicts a block diagram of example components in an edgecomputing system which implements a distributed network processingplatform, according to an example;

FIG. 8 is a block diagram illustrating the general flow for providingacceleration as a service, according to an example;

FIG. 9 depicts the decomposition or refactoring of a compound serviceinto its component microservices, functions, etc., according to anexample;

FIG. 10 depicts a number of “events” or triggers that either cause thevarious tasks to be triggered, resumed from a waiting state, orinterrupted in order to respond to some event, according to an example;

FIG. 11 depicts various data that is produced, consumed, or produced andconsumed by different tasks, according to an example;

FIG. 12 depicts the execution of a task that may generate triggers thataffect other tasks, according to an example;

FIG. 13 depicts a process of flow optimization, according to an example;

FIG. 14 depicts a subset of the graph illustrated in FIG. 12 , in whichfewer tasks and edges are shown, according to an example;

FIG. 15 depicts a table with a correspondence between the logicalidentifier of a task and the corresponding logical identifiers ofavailable accelerator implementations for that task, according to anexample;

FIG. 16 depicts an undirected dataflow graph of the tasks, according toan example;

FIG. 17 depicts the transformation from an unoptimized dataflow graph toan optimized version of acceleration as a service as implemented by theagency of IPUs, for the subset graph shown in FIG. 14 , according to anexample;

FIG. 18 depicts a database of information that shows for each type oflogical artifact various instances (i.e., execution capable resources)that can support its execution, according to an example;

FIG. 19 depicts various functional components of an IPU, according to anexample; and

FIG. 20 depicts a flowchart of a method for orchestrating accelerationfunctions in a network compute mesh, according to an example.

DETAILED DESCRIPTION

Various approaches for providing accelerators and accelerated functionsin an edge computing setting are discussed herein. Existing approachesrely on a centralized model where functions are executed at a centralstore, typically a datacenter, by clients that connect to the centralstore. In an edge-to-cloud compute continuum, there is a need to weaveacceleration into computations at scale. The rise of artificialintelligence (AI) and the need for real time machine-learning guidedlarge-scale distributed operations in all walks of life means thatprocessing devices will increasingly need to spend a lot of timecollaborating not just with their remote peers but also with remoteaccelerators. Infrastructure processing units (IPUs) may offload some ofthis work, but a large unaddressed need remains, which is to aggregateacceleration capabilities that are accessible at low latency within ahost (a VM, a process, a container, etc.) and provide a seamlessacceleration-as-a-service (XaaS) capability to high levelbusiness/client software layers. The architecture, which includes anacceleration-as-a-service layer, provides a mechanism to optimize thedata flows between different CPU/XPU elements that run different partsof a composite workload.

Various approaches and mechanisms are described herein to implement andenable acceleration pooling, intelligent scheduling and orchestration ofdependent tasks, and providing acceleration-as-a-service (XaaS) oraccelerated-function-as-a-service (XFaaS).

In various examples, the logic that is used to configure the mechanismsfor acceleration pooling and providing XaaS or XFaaS are managed by anetwork switch or other network-addressable component. For instance, anetwork switch can monitor or orchestrate execution flow of a databetween CPU and non-CPU-based (e.g., hardware accelerators) functions ormicroservices among network addressable compute nodes in a network.Non-CPU-based hardware may include circuitry and devices such asapplication-specific integrated circuits (ASIC), field-programmable gatearrays (FPGA), coarse-grained reconfigurable arrays (CGRA),system-on-chip (SOC), graphics processing units (GPUs), and the like.

Accordingly, the following describes coordinated, intelligent componentsto configure a combination of memory and compute resources for servicingclient workloads and increasing speed. While many of the techniques maybe implemented by a switch, orchestrator, or controller, the techniquesare also suited for use by networked processing units such asinfrastructure processing units (IPUs, such as respective IPUs operatingas a memory owner and remote memory consumer).

Additional implementation details of the providing acceleration oracceleration as a function in an edge computing network, implemented byway of a network switch or IPUs are provided among provided in FIGS. 8to 20 , below. General implementation details of an edge computingnetwork and the use of distributed networked processing units in such anetwork is provided in FIGS. 1 to 7 , below.

Distributed Edge Computing and Networked Processing Units

FIG. 1 is a block diagram 100 showing an overview of a distributed edgecomputing environment, which may be adapted for implementing the presenttechniques for distributed networked processing units. As shown, theedge cloud 110 is established from processing operations among one ormore edge locations, such as a satellite vehicle 141, a base station142, a network access point 143, an on premise server 144, a networkgateway 145, or similar networked devices and equipment instances. Theseprocessing operations may be coordinated by one or more edge computingplatforms 120 or systems that operate networked processing units (e.g.,IPUs, DPUs) as discussed herein.

The edge cloud 110 is generally defined as involving compute that islocated closer to endpoints 160 (e.g., consumer and producer datasources) than the cloud 130, such as autonomous vehicles 161, userequipment 162, business and industrial equipment 163, video capturedevices 164, drones 165, smart cities and building devices 166, sensorsand IoT devices 167, etc. Compute, memory, network, and storageresources that are offered at the entities in the edge cloud 110 canprovide ultra-low or improved latency response times for services andfunctions used by the endpoint data sources as well as reduce networkbackhaul traffic from the edge cloud 110 toward cloud 130 thus improvingenergy consumption and overall network usages among other benefits.

Compute, memory, and storage are scarce resources, and generallydecrease depending on the edge location (e.g., fewer processingresources being available at consumer end point devices than at a basestation or a central office data center). As a general design principle,edge computing attempts to minimize the number of resources needed fornetwork services, through the distribution of more resources that arelocated closer both geographically and in terms of in-network accesstime.

FIG. 2 depicts examples of computing hardware provided among respectivedeployment tiers in a distributed edge computing environment. Here, onetier at an on-premise edge system is an intelligent sensor or gatewaytier 210, which operates network devices with low power and entry-levelprocessors and low-power accelerators. Another tier at an on-premiseedge system is an intelligent edge tier 220, which operates edge nodeswith higher power limitations and may include a high-performancestorage.

Further in the network, a network edge tier 230 operates serversincluding form factors optimized for extreme conditions (e.g.,outdoors). A data center edge tier 240 operates additional types of edgenodes such as servers, and includes increasingly powerful or capablehardware and storage technologies. Still further in the network, a coredata center tier 250 and a public cloud tier 260 operate computeequipment with the highest power consumption and largest configurationof processors, acceleration, storage/memory devices, and highestthroughput network.

In each of these tiers, various forms of Intel® processor lines aredepicted for purposes of illustration; it will be understood that otherbrands and manufacturers of hardware will be used in real-worlddeployments. Additionally, it will be understood that additionalfeatures or functions may exist among multiple tiers. One such exampleis connectivity and infrastructure management that enable a distributedIPU architecture, that can potentially extend across all of tiers 210,220, 230, 240, 250, 260. Other relevant functions that may extend acrossmultiple tiers may relate to security features, domain or groupfunctions, and the like.

FIG. 3 depicts additional characteristics of respective deployment tiersin a distributed edge computing environment, based on the tiersdiscussed with reference to FIG. 2 . This figure depicts additionalnetwork latencies at each of the tiers 210, 220, 230, 240, 250, 260, andthe gradual increase in latency in the network as the compute is locatedat a longer distance from the edge endpoints. Additionally, this figuredepicts additional power and form factor constraints, use cases, and keyperformance indicators (KPIs).

With these variations and service features in mind, edge computingwithin the edge cloud 110 may provide the ability to serve and respondto multiple applications of the use cases in real-time or near real-timeand meet ultra-low latency requirements. As systems have becomehighly-distributed, networking has become one of the fundamental piecesof the architecture that allow achieving scale with resiliency,security, and reliability. Networking technologies have evolved toprovide more capabilities beyond pure network routing capabilities,including to coordinate quality of service, security, multi-tenancy, andthe like. This has also been accelerated by the development of new smartnetwork adapter cards and other type of network derivatives thatincorporated capabilities such as ASICs (application-specific integratedcircuits) or FPGAs (field programmable gate arrays) to accelerate someof those functionalities (e.g., remote attestation).

In these contexts, networked processing units have begun to be deployedat network cards (e.g., smart NICs), gateways, and the like, which allowdirect processing of network workloads and operations. One example of anetworked processing unit is an infrastructure processing unit (IPU),which is a programmable network device that can be extended to providecompute capabilities with far richer functionalities beyond purenetworking functions. Another example of a network processing unit is adata processing unit (DPU), which offers programmable hardware forperforming infrastructure and network processing operations. Thefollowing discussion refers to functionality applicable to an IPUconfiguration, such as that provided by an Intel® line of IPUprocessors. However, it will be understood that functionality will beequally applicable to DPUs and other types of networked processing unitsprovided by ARM®, Nvidia®, and other hardware OEMs.

FIG. 4 depicts an example compute system architecture that includes acompute platform 420 and a network processing platform comprising an IPU410. This architecture—and in particular the IPU 410—can be managed,coordinated, and orchestrated by the functionality discussed below,including with the functions described with reference to FIG. 6 .

The main compute platform 420 is composed by typical elements that areincluded with a computing node, such as one or more CPUs 424 that may ormay not be connected via a coherent domain (e.g., via Ultra PathInterconnect (UPI) or another processor interconnect); one or morememory units 425; one or more additional discrete devices 426 such asstorage devices, discrete acceleration cards (e.g., a field-programmablegate array (FPGA), a visual processing unit (VPU), etc.); a baseboardmanagement controller 421; and the like. The compute platform 420 mayoperate one or more containers 422 (e.g., with one or moremicroservices), within a container runtime 423 (e.g., Dockercontainers). The IPU 410 operates as a networking interface and isconnected to the compute platform 420 using an interconnect (e.g., usingeither PCIe or CXL). The IPU 410, in this context, can be observed asanother small compute device that has its own: (1) Processing cores(e.g., provided by low-power cores 417), (2) operating system (OS) andcloud native platform 414 to operate one or more containers 415 and acontainer runtime 416; (3) Acceleration functions provided by an ASIC411 or FPGA 412; (4) Memory 418; (5) Network functions provided bynetwork circuitry 413; etc.

From a system design perspective, this arrangement provides importantfunctionality. The IPU 410 is seen as a discrete device from the localhost (e.g., the OS running in the compute platform CPUs 424) that isavailable to provide certain functionalities (networking, accelerationetc.). Those functionalities are typically provided via Physical orVirtual PCIe functions. Additionally, the IPU 410 is seen as a host(with its own IP etc.) that can be accessed by the infrastructure tosetup an OS, run services, and the like. The IPU 410 sees all thetraffic going to the compute platform 420 and can perform actions—suchas intercepting the data or performing some transformation—as long asthe correct security credentials are hosted to decrypt the traffic.Traffic going through the IPU goes to all the layers of the Open SystemsInterconnection model (OSI model) stack (e.g., from physical toapplication layer). Depending on the features that the IPU has,processing may be performed at the transport layer only. However, if theIPU has capabilities to perform traffic intercept, then the IPU also maybe able to intercept traffic at the traffic layer (e.g., intercept CDNtraffic and process it locally).

Some of the use cases being proposed for IPUs and similar networkedprocessing units include: to accelerate network processing; to managehosts (e.g., in a data center); or to implement quality of servicepolicies. However, most of functionalities today are focused at usingthe IPU at the local appliance level and within a single system. Theseapproaches do not address how the IPUs could work together in adistributed fashion or how system functionalities can be divided amongthe IPUs on other parts of the system. Accordingly, the followingintroduces enhanced approaches for enabling and controlling distributedfunctionality among multiple networked processing units. This enablesthe extension of current IPU functionalities to work as a distributedset of IPUs that can work together to achieve stronger features such as,resiliency, reliability, etc.

Distributed Architectures of IPUs

FIG. 5 depicts an IPU arrangement operating as a distributed networkprocessing platform within network and data center edge settings. In afirst deployment model of a computing environment 510, workloads orprocessing requests are directly provided to an IPU platform, such asdirectly to IPU 514. In a second deployment model of the computingenvironment 510, workloads or processing requests are provided to someintermediate processing device 512, such as a gateway or NUC (next unitof computing) device form factor, and the intermediate processing device512 forwards the workloads or processing requests to the IPU 514. Itwill be understood that a variety of other deployment models involvingthe composability and coordination of one or more IPUs, compute units,network devices, and other hardware may be provided.

With the first deployment model, the IPU 514 directly receives data fromuse cases 502A. The IPU 514 operates one or more containers withmicroservices to perform processing of the data. As an example, a smallgateway (e.g., a NUC type of appliance) may connect multiple cameras toan edge system that is managed or connected by the IPU 514. The IPU 514may process data as a small aggregator of sensors that runs on the faredge, or may perform some level of inline or preprocessing and thatsends payload to be further processed by the IPU or the system that theIPU connects.

With the second deployment model, the intermediate processing device 512provided by the gateway or NUC receives data from use cases 502B. Theintermediate processing device 512 includes various processing elements(e.g., CPU cores, GPUs), and may operate one or more microservices forservicing workloads from the use cases 502B. However, the intermediateprocessing device 512 invokes the IPU 514 to complete processing of thedata.

In either the first or the second deployment model, the IPU 514 mayconnect with a local compute platform, such as that provided by a CPU516 (e.g., Intel® Xeon CPU) operating multiple microservices. The IPUmay also connect with a remote compute platform, such as that providedat a data center by CPU 540 at a remote server. As an example, considera microservice that performs some analytical processing (e.g., facedetection on image data), where the CPU 516 and the CPU 540 provideaccess to this same microservice. The IPU 514, depending on the currentload of the CPU 516 and the CPU 540, may decide to forward the images orpayload to one of the two CPUs. Data forwarding or processing can alsodepend on other factors such as SLA for latency or performance metrics(e.g., perf/watt) in the two systems. As a result, the distributed IPUarchitecture may accomplish features of load balancing.

The IPU in the computing environment 510 may be coordinated with othernetwork-connected IPUs. In an example, a Service and Infrastructureorchestration manager 530 may use multiple IPUs as a mechanism toimplement advanced service processing schemes for the user stacks. Thismay also enable implementing of system functionalities such as failover,load balancing etc.

In a distributed architecture example, IPUs can be arranged in thefollowing non-limiting configurations. As a first configuration, aparticular IPU (e.g., IPU 514) can work with other IPUs (e.g., IPU 520)to implement failover mechanisms. For example, an IPU can be configuredto forward traffic to service replicas that runs on other systems when alocal host does not respond.

As a second configuration, a particular IPU (e.g., IPU 514) can workwith other IPUs (e.g., IPU 520) to perform load balancing across othersystems. For example, consider a scenario where CDN traffic targeted tothe local host is forwarded to another host in case that I/O or computein the local host is scarce at a given moment.

As a third configuration, a particular IPU (e.g., IPU 514) can work as apower management entity to implement advanced system policies. Forexample, consider a scenario where the whole system (e.g., including CPU516) is placed in a C6 state (a low-power/power-down state available toa processor) while forwarding traffic to other systems (e.g., IPU 520)and consolidating it.

As will be understood, fully coordinating a distributed IPU architecturerequires numerous aspects of coordination and orchestration. Thefollowing examples of system architecture deployments provide discussionof how edge computing systems may be adapted to include coordinatedIPUs, and how such deployments can be orchestrated to use IPUs atmultiple locations to expand to the new envisioned functionality.

Distributed IPU Functionality

An arrangement of distributed IPUs offers a set of new functionalitiesto enable IPUs to be service focused. FIG. 6 depicts functionalcomponents of an IPU 610, including services and features to implementthe distributed functionality discussed herein. It will be understoodthat some or all of the functional components provided in FIG. 6 may bedistributed among multiple IPUs, hardware components, or platforms,depending on the particular configuration and use case involved.

In the block diagram of FIG. 6 , a number of functional components areoperated to manage requests for a service running in the IPU (or runningin the local host). As discussed above, IPUs can either run services orintercept requests arriving to services running in the local host andperform some action. In the latter case, the IPU can perform thefollowing types of actions/functions (provided as a non-limitingexamples).

Peer Discovery. In an example, each IPU is provided with Peer Discoverylogic to discover other IPUs in the distributed system that can worktogether with it. Peer Discovery logic may use mechanisms such asbroadcasting to discover other IPUs that are available on a network. ThePeer Discovery logic is also responsible to work with the PeerAttestation and Authentication logic to validate and authenticate thepeer IPU's identity, determine whether they are trustworthy, and whetherthe current system tenant allows the current IPU to work with them. Toaccomplish this, an IPU may perform operations such as: retrieve a proofof identity and proof of attestation; connect to a trusted servicerunning in a trusted server; or, validate that the discovered system istrustworthy. Various technologies (including hardware components orstandardized software implementations) that enable attestation,authentication, and security may be used with such operations.

Peer Attestation. In an example, each IPU provides interfaces to otherIPUs to enable attestation of the IPU itself. IPU Attestation logic isused to perform an attestation flow within a local IPU in order tocreate the proof of identity that will be shared with other IPUs.Attestation here may integrate previous approaches and technologies toattest a compute platform. This may also involve the use of trustedattestation service 640 to perform the attestation operations.

Functionality Discovery. In an example, a particular IPU includescapabilities to discover the functionalities that peer IPUs provide.Once the authentication is done, the IPU can determine whatfunctionalities that the peer IPUs provide (using the IPU Peer DiscoveryLogic) and store a record of such functionality locally. Examples ofproperties to discover can include: (i) Type of IPU and functionalitiesprovided and associated KPIs (e.g. performance/watt, cost etc.); (ii)Available functionalities as well as possible functionalities to executeunder secure enclaves (e.g., enclaves provided by Intel® SGX or TDXtechnologies); (iii) Current services that are running on the IPU and onthe system that can potentially accept requests forwarded from this IPU;or (iv) Other interfaces or hooks that are provided by an IPU, such as:Access to remote storage; Access to a remote VPU; Access to certainfunctions. In a specific example, service may be described by propertiessuch as: UUID; Estimated performance KPIs in the host or IPU; Averageperformance provided by the system during the N units of time (or anyother type of indicator); and like properties.

Service Management. The IPU includes functionality to manage servicesthat are running either on the host compute platform or in the IPUitself. Managing (orchestration) services includes performance serviceand resource orchestration for the services that can run on the IPU orthat the IPU can affect. Two type of usage models are envisioned:

External Orchestration Coordination. The IPU may enable externalorchestrators to deploy services on the IPU compute capabilities. To doso, an IPU includes a component similar to K8 compatible APIs to managethe containers (services) that run on the IPU itself. For example, theIPU may run a service that is just providing content to storageconnected to the platform. In this case, the orchestration entityrunning in the IPU may manage the services running in the IPU as ithappens in other systems (e.g. keeping the service level objectives).

Further, external orchestrators can be allowed to register to the IPUthat services are running on the host may require to broker requests,implement failover mechanisms and other functionalities. For example, anexternal orchestrator may register that a particular service running onthe local compute platform is replicated in another edge node managed byanother IPU where requests can be forwarded.

In this latter use case, external orchestrators may provide to theService/Application Intercept logic the inputs that are needed tointercept traffic for these services (as typically is encrypted). Thismay include properties such as a source and destination traffic of thetraffic to be intercepted, or the key to use to decrypt the traffic.Likewise, this may be needed to terminate TLS to understand the requeststhat arrive to the IPU and that the other logics may need to parse totake actions. For example, if there is a CDN read request the IPU mayneed to decrypt the packet to understand that network packet includes aread request and may redirect it to another host based on the contentthat is being intercepted. Examples of Service/Application Interceptinformation is depicted in table 620 in FIG. 6 .

External Orchestration Implementation. External orchestration can beimplemented in multiple topologies. One supported topology includeshaving the orchestrator managing all the IPUs running on the backendpublic or private cloud. Another supported topology includes having theorchestrator managing all the IPUs running in a centralized edgeappliance. Still another supported topology includes having theorchestrator running in another IPU that is working as the controller orhaving the orchestrator running distributed in multiple other IPUs thatare working as controllers (master/primary node), or in a hierarchicalarrangement.

Functionality for Broker requests. The IPU may include Service RequestBrokering logic and Load Balancing logic to perform brokering actions onarrival for requests of target services running in the local system. Forinstance, the IPU may decide to see if those requests can be executed byother peer systems (e.g., accessible through Service and InfrastructureOrchestration 630). This can be caused, for example, because load in thelocal systems is high. The local IPU may negotiate with other peer IPUsfor the possibility to forward the request. Negotiation may involvemetrics such as cost. Based on such negotiation metrics, the IPU maydecide to forward the request.

Functionality for Load Balancing requests. The Service Request Brokeringand Load Balancing logic may distribute requests arriving to the localIPU to other peer IPUs. In this case, the other IPUs and the local IPUwork together and do not necessarily need brokering. Such logic actssimilar to a cloud native sidecar proxy. For instance, requests arrivingto the system may be sent to the service X running in the local system(either IPU or compute platform) or forwarded to a peer IPU that hasanother instance of service X running. The load balancing distributioncan be based on existing algorithms such as based on the systems thathave lower load, using round robin, etc.

Functionality for failover, resiliency and reliability. The IPU includesReliability and Failover logic to monitor the status of the servicesrunning on the compute platform or the status of the compute platformitself. The Reliability and Failover logic may require the LoadBalancing logic to transiently or permanently forward requests that aimspecific services in situations such as where: i) The compute platformis not responding; ii) The service running inside the compute node isnot responding; and iii) The compute platform load prevents the targetedservice to provide the right level of service level objectives (SLOs).Note that the logic must know the required SLOs for the services. Suchfunctionality may be coordinated with service information 650 includingSLO information.

Functionality for executing parts of the workloads. Use cases such asvideo analytics tend to be decomposed in different microservices thatconform a pipeline of actions that can be used together. The IPU mayinclude a workload pipeline execution logic that understands howworkloads are composed and manage their execution. Workloads can bedefined as a graph that connects different microservices. The loadbalancing and brokering logic may be able to understand those graphs anddecide what parts of the pipeline are executed where. Further, toperform these and other operations, Intercept logic will also decodewhat requests are included as part of the requests.

Resource Management

A distributed network processing configuration may enable IPUs toperform important role for managing resources of edge appliances. Asfurther shown in FIG. 6 , the functional components of an IPU canoperate to perform these and similar types of resource managementfunctionalities.

As a first example, an IPU can provide management or access to externalresources that are hosted in other locations and expose them as localresources using constructs such as Compute Express Link (CXL). Forexample, the IPU could potentially provide access to a remoteaccelerator that is hosted in a remote system via CXL.mem/cache and IO.Another example includes providing access to remote storage devicehosted in another system. In this latter case, the local IPU could workwith another IPU in the storage system and expose the remote system asPCIE VF/PF (virtual functions/physical functions) to the local host.

As a second example, an IPU can provide access to IPU-specificresources. Those IPU resource may be physical (such as storage ormemory) or virtual (such as a service that provides access to randomnumber generation).

As a third example, an IPU can manage local resources that are hosted inthe system where it belongs. For example, the IPU can manage power ofthe local compute platform.

As a fourth example, an IPU can provide access to other type of elementsthat relate to resources (such as telemetry or other types of data). Inparticular, telemetry provides useful data for something that is neededto decide where to execute things or to identify problems.

I/O Management. Because the IPU is acting as a connection proxy betweenthe external peers (compute systems, remote storage etc.) resources andthe local compute, the IPU can also include functionality to manage I/Ofrom the system perspective.

Host Virtualization and XPU Pooling. The IPU includes HostVirtualization and XPU Pooling logic responsible to manage the access toresources that are outside the system domain (or within the IPU) andthat can be offered to the local compute system. Here, “XPU” refers toany type of a processing unit, whether CPU, GPU, VPU, an accelerationprocessing unit, etc. The IPU logic, after discovery and attestation,can agree with other systems to share external resources with theservices running in the local system. IPUs may advertise to other peersavailable resources or can be discovered during discovery phase asintroduced earlier. IPUs may request to other IPUS to those resources.For example, an IPU on system A may request access to storage on systemB manage by another IPU. Remote and local IPUs can work together toestablish a connection between the target resources and the localsystem.

Once the connection and resource mapping is completed, resources can beexposed to the services running in the local compute node using theVF/PF PCIE and CXL Logic. Each of those resources can be offered asVF/PF. The IPU logic can expose to the local host resources that arehosted in the IPU. Examples of resources to expose may include localaccelerators, access to services, and the like.

Power Management. Power management is one of the key features to achievefavorable system operational expenditures (OPEXs). IPU is very wellpositioned to optimize power consumption that the local system isconsuming. The Distributed and local power management unit Isresponsible to meter the power that the system is consuming, the loadthat the system is receiving and track the service level agreements thatthe various services running in the system are achieving for thearriving requests. Likewise, when power efficiencies (e.g., power usageeffectiveness (PUE)) are not achieving certain thresholds or the localcompute demand is low, the IPU may decide to forward the requests tolocal services to other IPUs that host replicas of the services. Suchpower management features may also coordinate with the Brokering andLoad Balancing logic discussed above. As will be understood, IPUs canwork together to decide where requests can be consolidated to establishhigher power efficiency as system. When traffic is redirected, the localpower consumption can be reduced in different ways. Example operationsthat can be performed include: changing the system to C6 State; changingthe base frequencies; performing other adaptations of the system orsystem components.

Telemetry Metrics. The IPU can generate multiple types of metrics thatcan be interesting from services, orchestration or tenants owning thesystem. In various examples, telemetry can be accessed, including: (i)Out of band via side interfaces; (ii) In band by services running in theIPU; or (iii) Out of band using PCIE or CXL from the host perspective.Relevant types of telemetries can include: Platform telemetry; ServiceTelemetry; IPU telemetry; Traffic telemetry; and the like.

System Configurations for Distributed Processing

Further to the examples noted above, the following configurations may beused for processing with distributed IPUs:

1) Local IPUs connected to a compute platform by an interconnect (e.g.,as shown in the configuration of FIG. 4 );

2) Shared IPUs hosted within a rack/physical network—such as in avirtual slice or multi-tenant implementation of IPUs connected viaCXL/PCI-E (local), or extension via Ethernet/Fiber for nodes within acluster;

3) Remote IPUs accessed via an IP Network, such as within certainlatency for data plane offload/storage offloads (or, connected formanagement/control plane operations); or

4) Distributed IPUs providing an interconnected network of IPUs,including as many as hundreds of nodes within a domain.

Configurations of distributed IPUs working together may also includefragmented distributed IPUs, where each IPU or pooled system providespart of the functionalities, and each IPU becomes a malleable system.Configurations of distributed IPUs may also include virtualized IPUs,such as provided by a gateway, switch, or an inline component (e.g.,inline between the service acting as IPU), and in some examples, inscenarios where the system has no IPU.

Other deployment models for IPUs may include IPU-to-IPU in the same tieror a close tier; IPU-to-IPU in the cloud (data to compute versus computeto data); integration in small device form factors (e.g., gateway IPUs);gateway/NUC+IPU which connects to a data center; multiple GW/NUC (e.g.16) which connect to one IPU (e.g. switch); gateway/NUC+IPU on theserver; and GW/NUC and IPU that are connected to a server with an IPU.

The preceding distributed IPU functionality may be implemented among avariety of types of computing architectures, including one or moregateway nodes, one or more aggregation nodes, or edge or core datacenters distributed across layers of the network (e.g., in thearrangements depicted in FIGS. 2 and 3 ). Accordingly, such IPUarrangements may be implemented in an edge computing system by or onbehalf of a telecommunication service provider (“telco”, or “TSP”),internet-of-things service provider, cloud service provider (CSP),enterprise entity, or any other number of entities. Variousimplementations and configurations of the edge computing system may beprovided dynamically, such as when orchestrated to meet serviceobjectives. Such edge computing systems may be embodied as a type ofdevice, appliance, computer, or other “thing” capable of communicatingwith other edge, networking, or endpoint components.

FIG. 7 depicts a block diagram of example components in a computingdevice 750 which can operate as a distributed network processingplatform. The computing device 750 may include any combinations of thecomponents referenced above, implemented as integrated circuits (ICs),as a package or system-on-chip (SoC), or as portions thereof, discreteelectronic devices, or other modules, logic, instruction sets,programmable logic or algorithms, hardware, hardware accelerators,software, firmware, or a combination thereof adapted in the computingdevice 750, or as components otherwise incorporated within a largersystem. Specifically, the computing device 750 may include processingcircuitry comprising one or both of a network processing unit 752 (e.g.,an IPU or DPU, as discussed above) and a compute processing unit 754(e.g., a CPU).

The network processing unit 752 may provide a networked specializedprocessing unit such as an IPU, DPU, network processing unit (NPU), orother “xPU” outside of the central processing unit (CPU). The processingunit may be embodied as a standalone circuit or circuit package,integrated within an SoC, integrated with networking circuitry (e.g., ina SmartNIC), or integrated with acceleration circuitry, storage devices,or AI or specialized hardware, consistent with the examples above.

The compute processing unit 754 may provide a processor as a centralprocessing unit (CPU) microprocessor, multi-core processor,multithreaded processor, an ultra-low voltage processor, an embeddedprocessor, or other forms of a special purpose processing unit orspecialized processing unit for compute operations.

Either the network processing unit 752 or the compute processing unit754 may be a part of a system on a chip (SoC) which includes componentsformed into a single integrated circuit or a single package. The networkprocessing unit 752 or the compute processing unit 754 and accompanyingcircuitry may be provided in a single socket form factor, multiplesocket form factor, or a variety of other formats.

The processing units 752, 754 may communicate with a system memory 756(e.g., random access memory (RAM)) over an interconnect 755 (e.g., abus). In an example, the system memory 756 may be embodied as volatile(e.g., dynamic random access memory (DRAM), etc.) memory. Any number ofmemory devices may be used to provide for a given amount of systemmemory. A storage 758 may also couple to the processor 752 via theinterconnect 755 to provide for persistent storage of information suchas data, applications, operating systems, and so forth. In an example,the storage 758 may be implemented as non-volatile storage such as asolid-state disk drive (SSD).

The components may communicate over the interconnect 755. Theinterconnect 755 may include any number of technologies, includingindustry-standard architecture (ISA), extended ISA (EISA), peripheralcomponent interconnect (PCI), peripheral component interconnect extended(PCIx), PCI express (PCIe), Compute Express Link (CXL), or any number ofother technologies. The interconnect 755 may couple the processing units752, 754 to a transceiver 766, for communications with connected edgedevices 762.

The transceiver 766 may use any number of frequencies and protocols. Forexample, a wireless local area network (WLAN) unit may implement Wi-Fi®communications in accordance with the Institute of Electrical andElectronics Engineers (IEEE) 802.11 standard, or a wireless wide areanetwork (WWAN) unit may implement wireless wide area communicationsaccording to a cellular, mobile network, or other wireless wide areaprotocol. The wireless network transceiver 766 (or multipletransceivers) may communicate using multiple standards or radios forcommunications at a different range. A wireless network transceiver 766(e.g., a radio transceiver) may be included to communicate with devicesor services in the edge cloud 110 or the cloud 130 via local or widearea network protocols.

The communication circuitry (e.g., transceiver 766, network interface768, external interface 770, etc.) may be configured to use any one ormore communication technology (e.g., wired or wireless communications)and associated protocols (e.g., a cellular networking protocol such a3GPP 4G or 5G standard, a wireless local area network protocol such asIEEE 802.11/Wi-Fi®, a wireless wide area network protocol, Ethernet,Bluetooth®, Bluetooth Low Energy, an IoT protocol such as IEEE 802.15.4or ZigBee®, Matter®, low-power wide-area network (LPWAN) or low-powerwide-area (LPWA) protocols, etc.) to effect such communication. Giventhe variety of types of applicable communications from the device toanother component or network, applicable communications circuitry usedby the device may include or be embodied by any one or more ofcomponents 766, 768, or 770. Accordingly, in various examples,applicable means for communicating (e.g., receiving, transmitting, etc.)may be embodied by such communications circuitry.

The computing device 750 may include or be coupled to accelerationcircuitry 764, which may be embodied by one or more AI accelerators, aneural compute stick, neuromorphic hardware, an FPGA, an arrangement ofGPUs, one or more SoCs, one or more CPUs, one or more digital signalprocessors, dedicated ASICs, or other forms of specialized processors orcircuitry designed to accomplish one or more specialized tasks. Thesetasks may include AI processing (including machine learning, training,inferencing, and classification operations), visual data processing,network data processing, object detection, rule analysis, or the like.Accordingly, in various examples, applicable means for acceleration maybe embodied by such acceleration circuitry.

The interconnect 755 may couple the processing units 752, 754 to asensor hub or external interface 770 that is used to connect additionaldevices or subsystems. The devices may include sensors 772, such asaccelerometers, level sensors, flow sensors, optical light sensors,camera sensors, temperature sensors, global navigation system (e.g.,GPS) sensors, pressure sensors, pressure sensors, and the like. The hubor interface 770 further may be used to connect the edge computing node750 to actuators 774, such as power switches, valve actuators, anaudible sound generator, a visual warning device, and the like.

In some optional examples, various input/output (I/O) devices may bepresent within or connected to, the edge computing node 750. Forexample, a display or other output device 784 may be included to showinformation, such as sensor readings or actuator position. An inputdevice 786, such as a touch screen or keypad may be included to acceptinput. An output device 784 may include any number of forms of audio orvisual display, including simple visual outputs such as LEDs or morecomplex outputs such as display screens (e.g., LCD screens), with theoutput of characters, graphics, multimedia objects, and the like beinggenerated or produced from the operation of the edge computing node 750.

A battery 776 may power the edge computing node 750, although, inexamples in which the edge computing node 750 is mounted in a fixedlocation, it may have a power supply coupled to an electrical grid, orthe battery may be used as a backup or for temporary capabilities. Abattery monitor/charger 778 may be included in the edge computing node750 to track the state of charge (SoCh) of the battery 776. The batterymonitor/charger 778 may be used to monitor other parameters of thebattery 776 to provide failure predictions, such as the state of health(SoH) and the state of function (SoF) of the battery 776. A power block780, or other power supply coupled to a grid, may be coupled with thebattery monitor/charger 778 to charge the battery 776.

In an example, the instructions 782 on the processing units 752, 754(separately, or in combination with the instructions 782 of themachine-readable medium 760) may configure execution or operation of atrusted execution environment (TEE) 790. In an example, the TEE 790operates as a protected area accessible to the processing units 752, 754for secure execution of instructions and secure access to data. Otheraspects of security hardening, hardware roots-of-trust, and trusted orprotected operations may be implemented in the edge computing node 750through the TEE 790 and the processing units 752, 754.

The computing device 750 may be a server, appliance computing devices,and/or any other type of computing device with the various form factorsdiscussed above. For example, the computing device 750 may be providedby an appliance computing device that is a self-contained electronicdevice including a housing, a chassis, a case, or a shell.

In an example, the instructions 782 provided via the memory 756, thestorage 758, or the processing units 752, 754 may be embodied as anon-transitory, machine-readable medium 760 including code to direct theprocessor 752 to perform electronic operations in the edge computingnode 750. The processing units 752, 754 may access the non-transitory,machine-readable medium 760 over the interconnect 755. For instance, thenon-transitory, machine-readable medium 760 may be embodied by devicesdescribed for the storage 758 or may include specific storage units suchas optical disks, flash drives, or any number of other hardware devices.The non-transitory, machine-readable medium 760 may include instructionsto direct the processing units 752, 754 to perform a specific sequenceor flow of actions, for example, as described with respect to theflowchart(s) and block diagram(s) of operations and functionalitydiscussed herein. As used herein, the terms “machine-readable medium”,“machine-readable storage”, “computer-readable storage”, and“computer-readable medium” are interchangeable.

In further examples, a machine-readable medium also includes anytangible medium that is capable of storing, encoding, or carryinginstructions for execution by a machine and that cause the machine toperform any one or more of the methodologies of the present disclosureor that is capable of storing, encoding or carrying data structuresutilized by or associated with such instructions. A “machine-readablemedium” thus may include but is not limited to, solid-state memories,and optical and magnetic media. The instructions embodied by amachine-readable medium may further be transmitted or received over acommunications network using a transmission medium via a networkinterface device utilizing any one of a number of transfer protocols(e.g., HTTP).

A machine-readable medium may be provided by a storage device or otherapparatus which is capable of hosting data in a non-transitory format.In an example, information stored or otherwise provided on amachine-readable medium may be representative of instructions, such asinstructions themselves or a format from which the instructions may bederived. This format from which the instructions may be derived mayinclude source code, encoded instructions (e.g., in compressed orencrypted form), packaged instructions (e.g., split into multiplepackages), or the like. The information representative of theinstructions in the machine-readable medium may be processed byprocessing circuitry into the instructions to implement any of theoperations discussed herein. For example, deriving the instructions fromthe information (e.g., processing by the processing circuitry) mayinclude: compiling (e.g., from source code, object code, etc.),interpreting, loading, organizing (e.g., dynamically or staticallylinking), encoding, decoding, encrypting, unencrypting, packaging,unpackaging, or otherwise manipulating the information into theinstructions.

In an example, the derivation of the instructions may include assembly,compilation, or interpretation of the information (e.g., by theprocessing circuitry) to create the instructions from some intermediateor preprocessed format provided by the machine-readable medium. Theinformation, when provided in multiple parts, may be combined, unpacked,and modified to create the instructions. For example, the informationmay be in multiple compressed source code packages (or object code, orbinary executable code, etc.) on one or several remote servers.

In further examples, a software distribution platform (e.g., one or moreservers and one or more storage devices) may be used to distributesoftware, such as the example instructions discussed above, to one ormore devices, such as example processor platform(s) and/or exampleconnected edge devices noted above. The example software distributionplatform may be implemented by any computer server, data facility, cloudservice, etc., capable of storing and transmitting software to othercomputing devices. In some examples, the providing entity is adeveloper, a seller, and/or a licensor of software, and the receivingentity may be consumers, users, retailers, OEMs, etc., that purchaseand/or license the software for use and/or re-sale and/or sub-licensing.

Turning now to FIGS. 8-20 , these illustrate various mechanisms foraccelerator pooling and exposing accelerators as a service. In adistributed IPU environment (e.g., IPU mesh), one or more IPU elementsmay provide accelerated functions. The IPU-managed accelerators may bepooled into a resource pool from which other compute platforms can tapinto and borrow from. IPU and non-IPU elements may acts as eitherproducers or consumers in such a resource-sharing environment in orderto provide or use accelerator services. In addition to contributing itsown surplus accelerator resources into a pool from which others canborrow, a particular IPU may also borrow accelerator resources from thepool to which others (either local or remote compute platforms)contribute. Further, the IPU may provide management of such pooling, sothat it acts flexibly as producer, consumer, aggregator, and broker ofaccelerator resources between providers and consumers.

As part of the role of orchestrator, an IPU may allow for smart chainingof tasks based on various attributes. Such attributes may identify the“locality” of the task. This locality may be used to estimate how muchoverhead may be incurred when transferring data and control from onetask to another. Tasks on CPUs and XPUs that request acceleratoroperations at an IPU may also use a channel application programminginterface (API) at the IPU to route data from one accelerator operationto another acceleration operation. Tasks may be implemented asmicroservices, serverless functions, deployable artifacts, or the like.Thus, the IPU removes the need to move data across links when thechannel being established is local to accelerator slices at the sameIPU, and it also removes the need to route data across sidecars, orother intermediaries because the chained flows can be set up fromaccelerator to accelerator, whether by local CXL or some otherhigh-speed fabric, by memory pooling, or over network transport.

When acting as an orchestrator or scheduler, an IPU may operate based oncharacteristics of service level policies (e.g., SLAs or SLOs). Forinstance, the data flows from task_1 to task_2 by the IPU selecting ascheduling policy to maximize effectiveness of local caching of theproducer-to-consumer streams, instead of flows between tasks thatrequire data to be spilled-to and filled-from memory or storage. Theindirect reads and writes to memory or storage are relatively slow andincrease latency along with increasing memory requirements.

The aggregation of accelerator functions may be both at a hardware levelin provisioning and switching setup, and at a higher, software/driverlevel where the IPU acts as a capacity aggregator to dispatchaccelerator invocations transparently between software and pooledversions of hardware.

The IPU may expose accelerator functions as a service (XFaas).

In such an implementation, the IPU offloads from the CPU theresponsibility of mapping an event to an accelerated function, so thatthe event is simply forwarded to the IPU. The IPU then handles theinvocation of the corresponding accelerator function. This is possiblesince there is no long duration state with a function (unlike a statefulservice on a CPU which in general may not be movable to an IPU).

The accelerator functions may be virtualized. In such an implementation,the user of an accelerator is able to invoke a standard interface anddiscover or direct the accelerator capability using a simple,parameterized call instead of having to invoke the accelerator functionin low level manner. The IPU maps the parameters in the call to the typeof acceleration that needs to be performed. This is synergistic withpooled use of acceleration resource and XFaaS because it hides thelow-level minutiae of how the acceleration is being performed, where itis being performed, etc., from the invoker.

FIG. 8 is a block diagram illustrating the general flow for providingacceleration as a service, according to an example. To provideacceleration as a service, there are three main phases: capture the taskdependencies in a graph (operation 802), map the graph to availableacceleration functions (operation 804), and orchestrate the flow of databetween CPU and non-CPU based functions (operation 806).

At 802, the logical organization of a solution is captured in the formof a graph of computations (tasks). These computations (tasks) may beserverless functions, stateful serverless, or microservices.

At 804, the graph of computations is mapped to available acceleratorfunctions at some fine granularity of a cluster. For instance, a cliquecomprising between 1K and 10K cores may be used as an acceleratorservice where execution can be considered to be tightly coupled from ascheduling perspective.

At 806, the flow of data between the CPU based and non-CPU basedfunctions or microservices is scheduled and optimized, withinter-operable networking capabilities from IPUs in the clique, so thatclique computations flow seamlessly between CPU-XPU implementedmicroservices with very light mediation by CPU based software.

FIG. 9 depicts the decomposition or refactoring of a compound serviceinto its component microservices, functions, etc., shown as circles R,S, . . . , Z. Here, FIG. 9 is used as a common example through theremainder of this writeup. Unless it is necessary to be specific, theterm “Task” will refer to these component functions, microservices, etc.Performing multiple tasks in a prescribed sequence may be used to fulfilor complete a job or workload instance.

FIG. 10 depicts a number of “events” or triggers (a, b, . . . , g) thateither cause the various tasks R-Z to be triggered, resumed from awaiting state, or interrupted in order to respond to some event.Triggers or events that cause one task to be resumed or be otherwiseaffected in some way may be generated externally or may arise from theexecution of one of the other tasks R-Z. Thus, trigger g causes tasks Sand W to be notified and consequently activate or begin processing insome way, trigger b causes task Y to be notified and consequentlyactivate or begin processing.

FIG. 11 depicts various data that is produced, consumed, or produced andconsumed by the different tasks R-Z in a job. Data produced by one ormore of these tasks may be stored in a distributed datastore (e.g., adatabase, a datalake, a datastream, etc.). Similarly, data used by oneof these tasks as input may be retrieved from the distributed datastore.

FIG. 12 depicts the execution of a task that may generate triggers thataffect other tasks (shown by dashed arrows). Similarly, FIG. 12 showsthe actual data dependencies (e.g., production/consumptionrelationships) between tasks with the solid thick arrows.

In FIG. 11 , each task was sourcing or sinking data that it respectivelyconsumed or produced into a datalake, a datastream, etc., and FIG. 12shows the actual logical producer-consumer relationships between tasksin which the datalakes, datastreams etc. of FIG. 11 are carriers of thedata transiting between the producers and consumers. Not all data thatis produced or consumed needs to come from the execution of anothertask. For example as shown in FIG. 12 , any data produced by theexecution of tasks Y, V, Z, S, and U is not consumed by any of the othertasks, and tasks X, U, and S do not consume data produced by any of theother tasks. FIG. 12 depicts the dataflow and execution dependencies ofthe tasks as shown graphically, which may be called a flowgraph. Aflowgraph is a representation, using graph notation, of all paths thatmay be traversed through a program during its execution.

FIG. 13 depicts a process 1300 of flow optimization, according to anexample. A flowgraph is accessed (operation 1302). The flowgraph may bespecified to IPUs through an interface that is supported by eitherhardware or software logic at one or more IPUs (operation 1304). Thosetasks that are able to be accelerated by available accelerationresources, along with tasks that are able to be alternativelyimplemented in classic (traditional) CPU-based software logic, are thenseamlessly initiated in a clique wide distribution by the IPU basedacceleration-as-a-service logic (operation 1306). The operation of thislogic also achieves flow optimization (operation 1308). Optimization mayattempt to achieve various goals, such as minimizing the amount of datamovement, minimizes the latency, maximizing resource utilization, ormaximizing capacity of acceleration available resources. Other goals maybe used in determining what is considered a local optimization. Forinstance, data flow optimization may mean either eliminating or reducingto a minimum the amount of data that is moved from producing tasks intoa datalake/datastore/datastream of FIG. 13 , when that data is justephemeral and is consumed by one of the other tasks as shown by thedataflow relationships in FIG. 12 which are described to the servicelogic by the specifying of the flowgraph in FIG. 13 .

Next for convenience of description, a subset of the flowgraphillustrated in FIG. 12 is depicted in FIG. 14 to show some of theadditional mechanisms and data structures. In particular, FIG. 14 showsthat subset of FIG. 12 , in which fewer tasks and edges are shown. Foreach task such as R, T, Y, and W, a chart shown in FIG. 15 contains acorrespondence between the logical ID of the task and the correspondinglogical IDs of available accelerator implementations for that task.Thus, for a logical task R, three possible implementations (artifacts)exist. The first is R0, which is a CPU-based software function (e.g.,source, binary, or intermediate form). The second is R1, which is in theform of GPU-oriented software function (e.g., in one ofsource/binary/intermediate forms). The third is R2 and is in the form ofan FPGA bitstream. The IDs R0, R1, and R2 may be URIs, URLs, UUIDs, etc.that help distinguish, identify, and locate the artifacts. The artifactsmay be optionally replicated for easy distributed availability so thatthe service can launch a standard (unaccelerated) version of R as R0 ona CPU or an accelerated version of R as version R1 on a GPU.Alternatively, the task may be implemented on an FPGA version R2 on oneor more FPGAs it can obtain from a distributed clique orchestrationservice such as K8s, Openstack, etc. Generally, artifacts include filesused to create and run an application. As such, an artifact may refer tocompiled code, an executable file, a bytecode file, a configurationbinary, a bitstream, or the like.

When referring to Java, an artifact may be identified by name, version,and scope. Scope may indicate whether it is a user artifact that existsin a namespace and cannot be accessed by other namespaces, or a systemscope artifact, which can be accessed by any namespace. In Kubernetes,artifacts represent resources that are updated as a part of a deploymentpipeline.

FIG. 15 also shows for example, that task Y can only be run as aCPU-based software logic using a software artifact Y0. Conversely, taskT may run in the form of a traditional CPU-based program using artifactTO, on an FPGA in the form of a bitstream artifact T1, or on a specialpurpose ASIC in the form of an artifact T2. Artifacts may be in the formof a bitstream, bit file, programming file, or executable file, binaryfile, other configuration file used to configure an FPGA, CGRA, ASIC, orgeneral CPU to execute an acceleration operation.

A registry of artifacts may be stored in a distributed database, acentralized database, or otherwise available to one or more IPUs thatare orchestrating and scheduling tasks. The registry may includeidentifiers of the artifacts, their location, and other metadata aboutthe artifacts, such as reliability, capability, security features,geographical location, service costs, and the like. The registry may bestored in a datastore, datalake, database, or the like, such asillustrated and described in FIGS. 11, 17, and 18 .

FIG. 16 depicts an undirected dataflow graph of the tasks R, T, Y, andW, according to an example. In particular, FIG. 16 illustrates a logicaldataflow relationship in the form of an undirected edge adjacency listfor each task. The undirected dataflow graph can be used to schedule andorchestrate when tasks should be set into execution, along with whereand how ephemeral data should flow between the executing instances ofthose tasks in order to optimize the latencies, schedules, and resources(e.g. memory, storage, network allocations, etc.) at the respectiveexecution resources hosting the accelerated/unaccelerated versions oftheir artifacts. In general for each logical task, there may be multipledifferent logical artifacts, and each logical artifact may be capable ofbeing instanced (i.e., set up as an executing instance) on hardwareassets like CPUs, FPGAs, GPUs, ASICs, etc. on different hosts. Thisinformation is available to resources databases such as etcd (seehttps://etcd.io/) or others in an orchestration system. The flowoptimization step referred to in operation 1308 of FIG. 13 .

FIG. 17 depicts the transformation from an unoptimized dataflow graph1700A to an optimized version 1700B of acceleration as a service asimplemented by the agency of IPUs, for the subset graph shown in FIG. 14.

For the purposes of this discussion, each IPU illustrated in FIG. 17 isindicated with a numerical index identifier, such as “IPU4.” Thenumerical index identifier corresponds to a host on which the IPU iscolocated. So, in the case of IPU4, it is considered to be on “host 4.”

FIG. 17 illustrates an unoptimized model 1700A where task R runs as R01702 software in host 1, with IPU1 1704. Note that from the chart inFIG. 15 , it is understood that the artifact R0 1702 is a CPU softwareartifact and as such, runs on one or more CPUs at host 1.

Further consulting the table in FIG. 15 in combination with the dataflowgraph 1700A, task T runs as artifact T2 1712 on an ASIC available inhost 4, with IPU4 1714, task Y runs as Y0 1722 software on CPUs in host1, with IPU1 1704, and task W runs as FPGA-accelerated W1 1732 in host7, with IPU7 1734.

The dataflow graph 1700A illustrates how data is received by an IPU(e.g., IPU1 1704), provided as input to a task (e.g., R0 1702), and howthe resultant data is then returned to the IPU from the task, which maytransmit the resultant data back to the datastore, database, datastream,or other storage. Consequently, in the unoptimized model 1700A, eachexecution consumes its data from the datastore/datastream 1740(available as a distributed storage service) and produces its resultsand pushes them into the datastore/datastream for use by other tasks.

Among other services, one service that the respective IPUs in thedifferent hosts perform is that of sourcing or sinking the data intothat datastore/datastream, so that the execution logic of each instanceis not burdened with this responsibility, and also so that some otherCPU(s) in each of the hosts does not have to be interrupted in order totend to these data movement operations.

A more optimized or streamlined flow 1700B achieved by computingplacement and routing is illustrated in FIG. 17 . In this optimizedversion 1700B, R0 1702 and Y0 1722 are moved to host 4 (they wereexecuting on host 1) to be closer to T2 1712 which is also in host 4.Other flow actions are also undertaken so that: 1) the output of R0 1702is conveyed to T2 1712 by IPU4 1714; 2) the output of T2 1701 is alsoconveyed to Y0 1722 by IPU4 1714, and directly to IPU7 1734 by IPU41714; 3) the output of Y0 1722 is conveyed to datastore/datastream 1740directly by the host software on host 4 which executes Y0 1722, becausethere is no latency criticality and the CPUs already have the dataavailable for streaming into the datastore; and 4) the output of W1 1732(the FPGA-based implementation) in host 7 is fed into IPU4 1714 by IPU71734, for input to R0 1702.

Secondary types of flow optimizations are possible but are not shown inthe above example for simplicity. They include, for example, choosingpower efficient alternative implementations, choosing latency-optimizedimplementations, choosing performance/watt or performance/$/watt andother such criteria (and essentially mapping in differentresource-artifact combinations into the flow optimization formulation)based on any additional requirements specified in the task. These areconsidered as using layering of an optimization strategy that targets anevolving number of cost functions according to specified cost parametersbetween an application cohort and the IPU-XaaS service.

FIG. 18 illustrates a database of information (also referred to as aregistry) that stores each instance and type of logical artifact such asTO, T1, and T2 for a given task T. These instances (i.e., executioncapable resources) support the execution of the task. The registrydatabase contains various information about the instances, such asthroughput (capacity) of the implementation, the current running averageof the utilization of those resources, and many other secondary flowoptimization objectives referred to in the previous paragraph. Thisregistry database is also referred to as XDB (for accelerator database)and may be stored as a distributed eventually consistent datastoreavailable to the various orchestration modules that run on CPUs or IPUsin the clique for looking up and making decisions about flowoptimization.

FIG. 19 depicts various functional components of an IPU 1900, accordingto an example. Although the functions are illustrated as being containedwithin a single IPU, in some examples, the functions are providedcollectively across multiple IPUs in a clique of hosts through a commonAPI that is implemented between the CPUs and the IPUs of the hosts. Inan example, a CPU contains a software version of an IPU (essentially avirtualized IPU) so that in hosts that do not contain a discrete IPU,the IPU capabilities may be made available in a virtual form by the CPU.As such, various functionalities described with respect to IPU 1900 maybe provided in software in a virtual IPU 1950 executed on a CPU 1952 ona host 1954.

The IPU 1900 includes a transfer logic 1902, a repossess logic 1904, anaggregate logic 1906, a disaggregate logic 1908, an allocate logic 1910,a deallocate logic 1912, an acquire logic 1914, a release logic 1916, aflow route selection logic 1918, and a software implementation offloadlogic 1920. Additional IPU functions may be provided by auxiliary logic1922. The various logic described in FIG. 19 read and store informationand data in an accelerator database (XDB) 1940. A registry of artifactsmay be stored in the XDB 1940. Further, the XDB 1940 may be used tostore telemetry data, service level agreements (SLAs), service levelobjectives (SLOs), and other information used to optimize workflow,distribute tasks to resources, assign or instantiate artifacts atresources, lend and retrieve resources, aggregate resources, and handleoffloading of CPU tasks to IPUs.

The transfer logic 1902 may also be referred to as a lending logic inthat it provides the functionality to transfer or lend resources fromthe IPU to other CPUs or IPUs. The repossess logic 1904 may also bereferred to as a “retake” logic or recover logic, in that it providesthe functionality to repossess control of resources that were lent ortransferred. Together, the transfer logic 1902 and repossess logic 1904provide the borrowing or lending of resources available at a remote peerin order to achieve scaling of an acceleration-based service that can bescaled by creating multiple smaller capacity instances between which theIPUs perform load balancing. Such borrowing or lending may be scheduledin advance and extended as needed. Additionally, the repossess logic1904 may be configured to retake previously lent resources when they areeither returned early, or when the time duration of lending closes andthe borrower implicitly or explicitly vacates the use of the resource.

The aggregate logic 1906 is configured to relate resources together asan aggregated resource. In a corresponding manner, the disaggregatelogic 1908 removes the relationships between individual resources.Together, the aggregate logic 1906 disaggregate logic 1908 provide localaggregation of available resources for acceleration in order to scale uplocal capacity for a given task. Such aggregation may be homogeneous(e.g., aggregation is across identical artifact ID and using multipletiles of the hardware resource for that artifact), or heterogeneous(e.g., different artifacts of the same logical task are combined acrossdifferent hardware resources such as CPUs, GPUs, and FPGAs). The localaggregation/disaggregation also handles hot plug capability so thatacceleration capabilities may be dynamically furnished to a cliquethrough hardware upgrades, or CXL-based access to resource pools inwhich new acceleration capabilities are made available by bringing moreracks, pods, etc. online.

The allocate logic 1910 is configured to allocate resources of anaccelerator artifact to a task. In a corresponding manner, thedeallocate logic 1912 removes associations between resources and tasks.In particular, together, the allocate logic 1910 and deallocate logic1912 provide the allocation or binding of resources such as CPUs, GPUs,FPGAs, ASICs, etc., in order to produce a running (executing) version ofan artifact on the allocated resources, and releasing the resources uponcompletion or suspension of the task.

The acquire logic 1914 is configured to find and assign resources.Correspondingly, release logic 1916 is configured to release anyassignment of resources. Together, this logic are used to acquire andrelease locally-available resources for short term purposes, such aswhen a task is just a serverless function and it is not necessary toallocate and deallocate resources for a long duration.

The flow route selection logic 1918 is used to select a flowgraph andprovision an optimized implementation. An example of the functionalityof the flow route selection logic 1918 is provided in FIGS. 17 and 18above. The flowgraph and implementation control flows may be stored inthe XDB 1940.

The software implementation offload logic 1920 is used to offload aworkload from a CPU-based artifact to an IPU-based artifact. This issometimes desirable not just for managing the CPU's burden, but also tostreamline communication and dataflow when a task is very light and itmay save data transfers between the IPU and a CPU by just having the IPUperform the CPU's work instead.

The IPU 1900 provides a multitude of technical advantages including 1)transparent merging of acceleration resources through an as-a-serviceconsumption format; 2) optimized dataflows and low latencies; 3) low orno burden on CPUs for managing data flows between different types ofacceleration resources that are pressed into service; and streamlinedsoftware because application logic does not have to be concerned withphysical movements of data.

Additionally, IPU and acceleration pooling may leverage 5G and otherwireless architectures. Therefore, pooling may be included as part ofthe UPF for performing traffic steering. This functionality can beleveraged to expose remote accelerators via UPF.

The IPU and acceleration pooling can also include security aspects thatcan be used to perform the chain of tasks, this may include attestation,trust, etc. Such aspects may apply when there are pooled resources thatare distributed from edge to cloud and the security and trust boundariesare different.

Finally, the IPU and acceleration pooling can also include a mappinginto a K8s or cloud native architecture. For instance, in an examplewhere K8s plugins and operators can manage accelerators that aremanaged/exposed by the IPUs, to enable many more K8s construct types.

FIG. 20 is a flowchart illustrating a method for orchestratingacceleration functions in a network compute mesh, according to anexample. A network compute mesh includes a plurality of compute nodes,where each node includes at least a central processing unit (CPU) or setof CPUs. Some compute nodes in the network compute mesh may includenetwork-addressable processing units or networked processing units, suchas IPUs or DPUs.

A network-addressable processing unit (also referred to as a networkedprocessing unit) is a processing unit that has a unique network addressand is able to process network traffic. A network-addressable processingunit may work in concert with other processing units on a compute node.For instance, a network-addressable processing unit may be integratedwith a network interface card (NIC) and process network traffic for ageneral CPU. Although a network-addressable processing unit may providenetwork management facilities, a network-addressable processing unit mayalso be used to offload workloads from a CPU, expose acceleratorfunctions to other network-addressable processing units, and orchestrateworkflows between CPUs and network-addressable processing units onvarious compute nodes in the network compute mesh. In someimplementations, a network-addressable processing unit may have adistinct separate network address from the host that thenetwork-addressable processing unit is installed within so that thenetwork-addressable processing unit is separately addressable from thehost and can process network traffic that is not for the host.

Compute nodes in a network compute mesh may be organized into cliques. Aclique is a group of two or more compute nodes where each of the computenodes in the clique is adjacent to every other node in the clique. Thistight communication coupling allows for lightweight workflowadministration.

At 2002, the method 2000 includes accessing a flowgraph, the flowgraphincluding data producer-consumer relationships between a plurality oftasks in a workload.

At 2004, the method 2000 includes identifying available artifacts andresources to execute the artifacts to complete each of the plurality oftasks, where an artifact is an instance of a function to perform a taskof the plurality of tasks.

In an embodiment, the artifact comprises a bitstream to program afield-programmable gate array (FPGA). In an embodiment, the artifactcomprises an executable file to execute on a central processing unit(CPU). In an embodiment, the artifact comprises a binary file toconfigure a coarse-grained reconfigurable array (CGRA).

In an embodiment, the resources comprise a central processing unit. Inan embodiment, the resources comprise a network-accessible processingunit. In an embodiment, the resources comprise a graphics processingunit. In an embodiment, the resources comprise an application specificintegrated circuit (ASIC). In an embodiment, the resources comprise afield-programmable gate array (FPGA). In an embodiment, the resourcescomprise a coarse-grained reconfigurable array (CGRA).

At 2006, the method 2000 includes determining a configuration assigningartifacts and resources to each of the plurality of tasks in theflowgraph.

In an embodiment, determining the configuration includes analyzing aservice level objective (SLO) and assigning artifacts and resources toeach of the plurality of tasks to satisfy the SLO.

In an embodiment, determining the configuration includes performing oneor more of minimizing an amount of data movement between the pluralityof tasks and a storage device, minimizing latency of workload execution,maximizing resource utilization for the workload, maximizing capacity ofacceleration available resources, minimizing the power consumption ofthe workload, or optimizing the perf/watt or perf/S/watt metric for theworkload.

At 2008, the method 2000 includes scheduling, based on theconfiguration, the plurality of tasks to execute using the assignedartifacts and resources. In an embodiment, scheduling the plurality oftasks includes communicating from a first network-accessible processingunit to a second network-accessible processing unit via an applicationprogramming interface (API), to schedule a task of the plurality oftasks to execute using an artifact executing on a resource managed bythe second network-accessible processing unit. In a further embodiment,scheduling the plurality of task includes lending or transferringresources from the first network-accessible processing unit to thesecond network-accessible processing unit for use when executing theartifact.

In an embodiment, a task of the plurality of tasks produces a dataresult, which is stored in a distributed databased accessible by atleast one other task of the plurality of tasks.

Although these implementations have been described concerning specificexemplary aspects, it will be evident that various modifications andchanges may be made to these aspects without departing from the broaderscope of the present disclosure. Many of the arrangements and processesdescribed herein can be used in combination or in parallelimplementations that involve terrestrial network connectivity (whereavailable) to increase network bandwidth/throughput and to supportadditional edge services. Accordingly, the specification and drawingsare to be regarded in an illustrative rather than a restrictive sense.The accompanying drawings that form a part hereof show, by way ofillustration, and not of limitation, specific aspects in which thesubject matter may be practiced. The aspects illustrated are describedin sufficient detail to enable those skilled in the art to practice theteachings disclosed herein. Other aspects may be utilized and derivedtherefrom, such that structural and logical substitutions and changesmay be made without departing from the scope of this disclosure. ThisDetailed Description, therefore, is not to be taken in a limiting sense,and the scope of various aspects is defined only by the appended claims,along with the full range of equivalents to which such claims areentitled.

Such aspects of the inventive subject matter may be referred to herein,individually and/or collectively, merely for convenience and withoutintending to voluntarily limit the scope of this application to anysingle aspect or inventive concept if more than one is disclosed. Thus,although specific aspects have been illustrated and described herein, itshould be appreciated that any arrangement calculated to achieve thesame purpose may be substituted for the specific aspects shown. Thisdisclosure is intended to cover any adaptations or variations of variousaspects. Combinations of the above aspects and other aspects notspecifically described herein will be apparent to those of skill in theart upon reviewing the above description.

Embodiments may be implemented in one or a combination of hardware,firmware, and software. Embodiments may also be implemented asinstructions stored on a machine-readable storage device, which may beread and executed by at least one processor to perform the operationsdescribed herein. A machine-readable storage device may include anynon-transitory mechanism for storing information in a form readable by amachine (e.g., a computer). For example, a machine-readable storagedevice may include read-only memory (ROM), random-access memory (RAM),magnetic disk storage media, optical storage media, flash-memorydevices, and other storage devices and media.

Examples, as described herein, may include, or may operate on, logic ora number of components, such as modules, intellectual property (IP)blocks or cores, or mechanisms. Such logic or components may behardware, software, or firmware communicatively coupled to one or moreprocessors in order to carry out the operations described herein. Logicor components may be hardware modules (e.g., IP block), and as such maybe considered tangible entities capable of performing specifiedoperations and may be configured or arranged in a certain manner In anexample, circuits may be arranged (e.g., internally or with respect toexternal entities such as other circuits) in a specified manner as an IPblock, IP core, system-on-chip (SoC), or the like.

In an example, the whole or part of one or more computer systems (e.g.,a standalone, client or server computer system) or one or more hardwareprocessors may be configured by firmware or software (e.g.,instructions, an application portion, or an application) as a modulethat operates to perform specified operations. In an example, thesoftware may reside on a machine-readable medium. In an example, thesoftware, when executed by the underlying hardware of the module, causesthe hardware to perform the specified operations. Accordingly, the termhardware module is understood to encompass a tangible entity, be that anentity that is physically constructed, specifically configured (e.g.,hardwired), or temporarily (e.g., transitorily) configured (e.g.,programmed) to operate in a specified manner or to perform part or allof any operation described herein.

Considering examples in which modules are temporarily configured, eachof the modules need not be instantiated at any one moment in time. Forexample, where the modules comprise a general-purpose hardware processorconfigured using software; the general-purpose hardware processor may beconfigured as respective different modules at different times. Softwaremay accordingly configure a hardware processor, for example, toconstitute a particular module at one instance of time and to constitutea different module at a different instance of time. Modules may also besoftware or firmware modules, which operate to perform the methodologiesdescribed herein.

An IP block (also referred to as an IP core) is a reusable unit oflogic, cell, or integrated circuit. An IP block may be used as a part ofa field programmable gate array (FPGA), application-specific integratedcircuit (ASIC), programmable logic device (PLD), system on a chip (SoC),or the like. It may be configured for a particular purpose, such asdigital signal processing or image processing. Example IP cores includecentral processing unit (CPU) cores, integrated graphics, security,input/output (I/O) control, system agent, graphics processing unit(GPU), artificial intelligence, neural processors, image processingunit, communication interfaces, memory controller, peripheral devicecontrol, platform controller hub, or the like.

In some examples, the instructions are stored on storage devices of thesoftware distribution platform in a particular format. A format ofcomputer readable instructions includes, but is not limited to aparticular code language (e.g., Java, JavaScript, Python, C, C#, SQL,HTML, etc.), and/or a particular code state (e.g., uncompiled code(e.g., ASCII), interpreted code, linked code, executable code (e.g., abinary), etc.). In some examples, the computer readable instructionsstored in the software distribution platform are in a first format whentransmitted to an example processor platform(s). In some examples, thefirst format is an executable binary in which particular types of theprocessor platform(s) can execute. However, in some examples, the firstformat is uncompiled code that requires one or more preparation tasks totransform the first format to a second format to enable execution on theexample processor platform(s). For instance, the receiving processorplatform(s) may need to compile the computer readable instructions inthe first format to generate executable code in a second format that iscapable of being executed on the processor platform(s). In still otherexamples, the first format is interpreted code that, upon reaching theprocessor platform(s), is interpreted by an interpreter to facilitateexecution of instructions.

Use Cases and Additional Examples

An IPU can be hosted in any of the tiers that go from device to cloud.Any compute platform that needs connectivity can potentially include anIPU. Some examples of places where IPUs can be placed are: Vehicles; FarEdge; Data center Edge; Cloud; Smart Cameras; Smart Devices.

Some of the use cases for a distributed IPU may include the following.

1) Service orchestrator (local, shared, remote, or distributed): Power,Workload perf, ambient temp prediction and optimization tuning andservice orchestration not just locally but across distributed Edge Cloud

2) Infrastructure offload (for local machine)—same as traditional IPUuse-cases to offload network, storage, host virtualization etc. butadditional Edge Network Security Edge specific usages, Storage Edgespecific usages, Virtualization Edge specific usages

3) IPU as a host to augment compute capacity (using ARM/x86 cores) forrunning edge specific “functions” on demand, integrated as API/Serviceor running as K8s worker node for certain types of services, side carproxies, security attestation services, scrubbing traffic for SASE/L7inspection Firewall, Load balancer/Forward or reverse Proxy, ServiceMesh side cars (for each POD running on local host) etc. 5G UPF andother RAN offloads Etc.

Additional examples of the presently described method, system, anddevice embodiments include the following, non-limiting implementations.Each of the following non-limiting examples may stand on its own or maybe combined in any permutation or combination with any one or more ofthe other examples provided below or throughout the present disclosure.

Example 1 is a system for orchestrating acceleration functions in anetwork compute mesh, comprising: a memory device configured to storeinstructions; and a processor subsystem, which when configured by theinstructions, is operable to: access a flowgraph, the flowgraphincluding data producer-consumer relationships between a plurality oftasks in a workload; identify available artifacts and resources toexecute the artifacts to complete each of the plurality of tasks,wherein an artifact is an instance of a function to perform a task ofthe plurality of tasks; determine a configuration assigning artifactsand resources to each of the plurality of tasks in the flowgraph; andschedule, based on the configuration, the plurality of tasks to executeusing the assigned artifacts and resources.

In Example 2, the subject matter of Example 1 includes, wherein theartifact comprises a bitstream to program a field-programmable gatearray (FPGA).

In Example 3, the subject matter of Examples 1-2 includes, wherein theartifact comprises an executable file to execute on a central processingunit (CPU).

In Example 4, the subject matter of Examples 1-3 includes, wherein theartifact comprises a binary file to configure a coarse-grainedreconfigurable array (CGRA).

In Example 5, the subject matter of Examples 1˜4 includes, wherein theresources comprise a central processing unit.

In Example 6, the subject matter of Examples 1-5 includes, wherein theresources comprise a network-accessible processing unit.

In Example 7, the subject matter of Examples 1-6 includes, wherein theresources comprise a graphics processing unit.

In Example 8, the subject matter of Examples 1-7 includes, wherein theresources comprise an application specific integrated circuit (ASIC).

In Example 9, the subject matter of Examples 1-8 includes, wherein theresources comprise a field-programmable gate array (FPGA).

In Example 10, the subject matter of Examples 1-9 includes, wherein theresources comprise a coarse-grained reconfigurable array (CGRA).

In Example 11, the subject matter of Examples 1-10 includes, whereindetermining the configuration comprises: analyzing a service levelobjective (SLO); and assigning artifacts and resources to each of theplurality of tasks to satisfy the SLO.

In Example 12, the subject matter of Examples 1-11 includes, whereindetermining the configuration comprises performing one or more of:minimizing an amount of data movement between the plurality of tasks anda storage device, minimizing latency of workload execution, maximizingresource utilization for the workload, maximizing capacity ofacceleration available resources, minimizing the power consumption ofthe workload, or optimizing the perf/watt or perf/S/watt metric for theworkload.

In Example 13, the subject matter of Examples 1-12 includes, whereinscheduling the plurality of tasks comprises: communicating from a firstnetwork-accessible processing unit to a second network-accessibleprocessing unit via an application programming interface (API), toschedule a task of the plurality of tasks to execute using an artifactexecuting on a resource managed by the second network-accessibleprocessing unit.

In Example 14, the subject matter of Example 13 includes, whereinscheduling the plurality of task comprises: lending or transferringresources from the first network-accessible processing unit to thesecond network-accessible processing unit for use when executing theartifact.

In Example 15, the subject matter of Examples 1-14 includes, wherein atask of the plurality of tasks produces a data result, which is storedin a distributed databased accessible by at least one other task of theplurality of tasks.

Example 16 is a method for orchestrating acceleration functions in anetwork compute mesh, comprising: accessing a flowgraph, the flowgraphincluding data producer-consumer relationships between a plurality oftasks in a workload; identifying available artifacts and resources toexecute the artifacts to complete each of the plurality of tasks,wherein an artifact is an instance of a function to perform a task ofthe plurality of tasks; determining a configuration assigning artifactsand resources to each of the plurality of tasks in the flowgraph; andscheduling, based on the configuration, the plurality of tasks toexecute using the assigned artifacts and resources.

In Example 17, the subject matter of Example 16 includes, wherein theartifact comprises a bitstream to program a field-programmable gatearray (FPGA).

In Example 18, the subject matter of Examples 16-17 includes, whereinthe artifact comprises an executable file to execute on a centralprocessing unit (CPU).

In Example 19, the subject matter of Examples 16-18 includes, whereinthe artifact comprises a binary file to configure a coarse-grainedreconfigurable array (CGRA).

In Example 20, the subject matter of Examples 16-19 includes, whereinthe resources comprise a central processing unit.

In Example 21, the subject matter of Examples 16-20 includes, whereinthe resources comprise a network-accessible processing unit.

In Example 22, the subject matter of Examples 16-21 includes, whereinthe resources comprise a graphics processing unit.

In Example 23, the subject matter of Examples 16-22 includes, whereinthe resources comprise an application specific integrated circuit(ASIC).

In Example 24, the subject matter of Examples 16-23 includes, whereinthe resources comprise a field-programmable gate array (FPGA).

In Example 25, the subject matter of Examples 16-24 includes, whereinthe resources comprise a coarse-grained reconfigurable array (CGRA).

In Example 26, the subject matter of Examples 16-25 includes, whereindetermining the configuration comprises: analyzing a service levelobjective (SLO); and assigning artifacts and resources to each of theplurality of tasks to satisfy the SLO.

In Example 27, the subject matter of Examples 16-26 includes, whereindetermining the configuration comprises performing one or more of:minimizing an amount of data movement between the plurality of tasks anda storage device, minimizing latency of workload execution, maximizingresource utilization for the workload, maximizing capacity ofacceleration available resources, minimizing the power consumption ofthe workload, or optimizing the perf/watt or perf/$/watt metric for theworkload.

In Example 28, the subject matter of Examples 16-27 includes, whereinscheduling the plurality of tasks comprises: communicating from a firstnetwork-accessible processing unit to a second network-accessibleprocessing unit via an application programming interface (API), toschedule a task of the plurality of tasks to execute using an artifactexecuting on a resource managed by the second network-accessibleprocessing unit.

In Example 29, the subject matter of Example 28 includes, whereinscheduling the plurality of task comprises: lending or transferringresources from the first network-accessible processing unit to thesecond network-accessible processing unit for use when executing theartifact.

In Example 30, the subject matter of Examples 16-29 includes, wherein atask of the plurality of tasks produces a data result, which is storedin a distributed databased accessible by at least one other task of theplurality of tasks.

Example 31 is at least one machine-readable medium includinginstructions for orchestrating acceleration functions in a networkcompute mesh, which when executed by a machine, cause the machine to:access a flowgraph, the flowgraph including data producer-consumerrelationships between a plurality of tasks in a workload; identifyavailable artifacts and resources to execute the artifacts to completeeach of the plurality of tasks, wherein an artifact is an instance of afunction to perform a task of the plurality of tasks; determine aconfiguration assigning artifacts and resources to each of the pluralityof tasks in the flowgraph; and schedule, based on the configuration, theplurality of tasks to execute using the assigned artifacts andresources.

In Example 32, the subject matter of Example 31 includes, wherein theartifact comprises a bitstream to program a field-programmable gatearray (FPGA).

In Example 33, the subject matter of Examples 31-32 includes, whereinthe artifact comprises an executable file to execute on a centralprocessing unit (CPU).

In Example 34, the subject matter of Examples 31-33 includes, whereinthe artifact comprises a binary file to configure a coarse-grainedreconfigurable array (CGRA).

In Example 35, the subject matter of Examples 31-34 includes, whereinthe resources comprise a central processing unit.

In Example 36, the subject matter of Examples 31-35 includes, whereinthe resources comprise a network-accessible processing unit.

In Example 37, the subject matter of Examples 31-36 includes, whereinthe resources comprise a graphics processing unit.

In Example 38, the subject matter of Examples 31-37 includes, whereinthe resources comprise an application specific integrated circuit(ASIC).

In Example 39, the subject matter of Examples 31-38 includes, whereinthe resources comprise a field-programmable gate array (FPGA).

In Example 40, the subject matter of Examples 31-39 includes, whereinthe resources comprise a coarse-grained reconfigurable array (CGRA).

In Example 41, the subject matter of Examples 31-40 includes, whereinthe instructions to determine the configuration comprise theinstructions to: analyze a service level objective (SLO); and assignartifacts and resources to each of the plurality of tasks to satisfy theSLO.

In Example 42, the subject matter of Examples 31-41 includes, whereinthe instructions to determine the configuration comprises instructionsto performing one or more of minimizing an amount of data movementbetween the plurality of tasks and a storage device, minimizing latencyof workload execution, maximizing resource utilization for the workload,maximizing capacity of acceleration available resources, minimizing thepower consumption of the workload, or optimizing the perf/watt orperf/$/watt metric for the workload.

In Example 43, the subject matter of Examples 31-42 includes, whereinthe instructions to schedule the plurality of tasks compriseinstructions to: communicate from a first network-accessible processingunit to a second network-accessible processing unit via an applicationprogramming interface (API), to schedule a task of the plurality oftasks to execute using an artifact executing on a resource managed bythe second network-accessible processing unit.

In Example 44, the subject matter of Example 43 includes, whereinscheduling the plurality of task comprises: lending or transferringresources from the first network-accessible processing unit to thesecond network-accessible processing unit for use when executing theartifact.

In Example 45, the subject matter of Examples 31-44 includes, wherein atask of the plurality of tasks produces a data result, which is storedin a distributed databased accessible by at least one other task of theplurality of tasks.

Example 46 is at least one machine-readable medium includinginstructions that, when executed by processing circuitry, cause theprocessing circuitry to perform operations to implement of any ofExamples 1-45.

Example 47 is an apparatus comprising means to implement of any ofExamples 1-45.

Example 48 is a system to implement of any of Examples 1-45.

Example 49 is a method to implement of any of Examples 1-45.

The above detailed description includes references to the accompanyingdrawings, which form a part of the detailed description. The drawingsshow, by way of illustration, specific embodiments that may bepracticed. These embodiments are also referred to herein as “examples.”Such examples may include elements in addition to those shown ordescribed. However, also contemplated are examples that include theelements shown or described. Moreover, also contemplated are examplesusing any combination or permutation of those elements shown ordescribed (or one or more aspects thereof), either with respect to aparticular example (or one or more aspects thereof), or with respect toother examples (or one or more aspects thereof) shown or describedherein.

Publications, patents, and patent documents referred to in this documentare incorporated by reference herein in their entirety, as thoughindividually incorporated by reference. In the event of inconsistentusages between this document and those documents so incorporated byreference, the usage in the incorporated reference(s) are supplementaryto that of this document; for irreconcilable inconsistencies, the usagein this document controls.

In this document, the terms “a” or “an” are used, as is common in patentdocuments, to include one or more than one, independent of any otherinstances or usages of “at least one” or “one or more.” In thisdocument, the term “or” is used to refer to a nonexclusive or, such that“A or B” includes “A but not B,” “B but not A,” and “A and B,” unlessotherwise indicated. In the appended claims, the terms “including” and“in which” are used as the plain-English equivalents of the respectiveterms “comprising” and “wherein.” Also, in the following claims, theterms “including” and “comprising” are open-ended, that is, a system,device, article, or process that includes elements in addition to thoselisted after such a term in a claim are still deemed to fall within thescope of that claim. Moreover, in the following claims, the terms“first,” “second,” and “third,” etc. are used merely as labels, and arenot intended to suggest a numerical order for their objects.

The above description is intended to be illustrative, and notrestrictive. For example, the above-described examples (or one or moreaspects thereof) may be used in combination with others. Otherembodiments may be used, such as by one of ordinary skill in the artupon reviewing the above description. The Abstract is to allow thereader to quickly ascertain the nature of the technical disclosure. Itis submitted with the understanding that it will not be used tointerpret or limit the scope or meaning of the claims. Also, in theabove Detailed Description, various features may be grouped together tostreamline the disclosure. However, the claims may not set forth everyfeature disclosed herein as embodiments may feature a subset of saidfeatures. Further, embodiments may include fewer features than thosedisclosed in a particular example. Thus, the following claims are herebyincorporated into the Detailed Description, with a claim standing on itsown as a separate embodiment. The scope of the embodiments disclosedherein is to be determined with reference to the appended claims, alongwith the full scope of equivalents to which such claims are entitled.

What is claimed is:
 1. A system for orchestrating acceleration functionsin a network compute mesh, comprising: a memory device configured tostore instructions; and a processor subsystem, which when configured bythe instructions, is operable to: access a flowgraph, the flowgraphincluding data producer-consumer relationships between a plurality oftasks in a workload; identify available artifacts and resources toexecute the artifacts to complete each of the plurality of tasks,wherein an artifact is an instance of a function to perform a task ofthe plurality of tasks; determine a configuration assigning artifactsand resources to each of the plurality of tasks in the flowgraph; andschedule, based on the configuration, the plurality of tasks to executeusing the assigned artifacts and resources.
 2. The system of claim 1,wherein the artifact comprises a bitstream to program afield-programmable gate array (FPGA).
 3. The system of claim 1, whereinthe artifact comprises an executable file to execute on a centralprocessing unit (CPU).
 4. The system of claim 1, wherein the artifactcomprises a binary file to configure a coarse-grained reconfigurablearray (CGRA).
 5. The system of claim 1, wherein the resources comprise acentral processing unit.
 6. The system of claim 1, wherein the resourcescomprise a network-accessible processing unit.
 7. The system of claim 1,wherein the resources comprise a graphics processing unit.
 8. The systemof claim 1, wherein the resources comprise an application specificintegrated circuit (ASIC).
 9. The system of claim 1, wherein theresources comprise a field-programmable gate array (FPGA).
 10. Thesystem of claim 1, wherein the resources comprise a coarse-grainedreconfigurable array (CGRA).
 11. The system of claim 1, whereindetermining the configuration comprises: analyzing a service levelobjective (SLO); and assigning artifacts and resources to each of theplurality of tasks to satisfy the SLO.
 12. The system of claim 1,wherein determining the configuration comprises performing one or moreof minimizing an amount of data movement between the plurality of tasksand a storage device, minimizing latency of workload execution,maximizing resource utilization for the workload, maximizing capacity ofacceleration available resources, minimizing the power consumption ofthe workload, or optimizing the perf/watt or perf/$/watt metric for theworkload.
 13. The system of claim 1, wherein scheduling the plurality oftasks comprises: communicating from a first network-accessibleprocessing unit to a second network-accessible processing unit via anapplication programming interface (API), to schedule a task of theplurality of tasks to execute using an artifact executing on a resourcemanaged by the second network-accessible processing unit.
 14. The systemof claim 13, wherein scheduling the plurality of task comprises: lendingor transferring resources from the first network-accessible processingunit to the second network-accessible processing unit for use whenexecuting the artifact.
 15. The system of claim 1, wherein a task of theplurality of tasks produces a data result, which is stored in adistributed databased accessible by at least one other task of theplurality of tasks.
 16. A method for orchestrating accelerationfunctions in a network compute mesh, comprising: accessing a flowgraph,the flowgraph including data producer-consumer relationships between aplurality of tasks in a workload; identifying available artifacts andresources to execute the artifacts to complete each of the plurality oftasks, wherein an artifact is an instance of a function to perform atask of the plurality of tasks; determining a configuration assigningartifacts and resources to each of the plurality of tasks in theflowgraph; and scheduling, based on the configuration, the plurality oftasks to execute using the assigned artifacts and resources.
 17. Themethod of claim 16, wherein determining the configuration comprises:analyzing a service level objective (SLO); and assigning artifacts andresources to each of the plurality of tasks to satisfy the SLO.
 18. Themethod of claim 16, wherein determining the configuration comprisesperforming one or more of minimizing an amount of data movement betweenthe plurality of tasks and a storage device, minimizing latency ofworkload execution, maximizing resource utilization for the workload,maximizing capacity of acceleration available resources, minimizing thepower consumption of the workload, or optimizing the perf/watt orperf/$/watt metric for the workload.
 19. The method of claim 16, whereinscheduling the plurality of tasks comprises: communicating from a firstnetwork-accessible processing unit to a second network-accessibleprocessing unit via an application programming interface (API), toschedule a task of the plurality of tasks to execute using an artifactexecuting on a resource managed by the second network-accessibleprocessing unit.
 20. The method of claim 19, wherein scheduling theplurality of task comprises: lending or transferring resources from thefirst network-accessible processing unit to the secondnetwork-accessible processing unit for use when executing the artifact.21. The method of claim 16, wherein a task of the plurality of tasksproduces a data result, which is stored in a distributed databasedaccessible by at least one other task of the plurality of tasks.
 22. Atleast one machine-readable medium including instructions fororchestrating acceleration functions in a network compute mesh, whichwhen executed by a machine, cause the machine to: access a flowgraph,the flowgraph including data producer-consumer relationships between aplurality of tasks in a workload; identify available artifacts andresources to execute the artifacts to complete each of the plurality oftasks, wherein an artifact is an instance of a function to perform atask of the plurality of tasks; determine a configuration assigningartifacts and resources to each of the plurality of tasks in theflowgraph; and schedule, based on the configuration, the plurality oftasks to execute using the assigned artifacts and resources.
 23. The atleast one machine-readable medium of claim 22, wherein the instructionsto determine the configuration comprise the instructions to: analyze aservice level objective (SLO); and assign artifacts and resources toeach of the plurality of tasks to satisfy the SLO.
 24. The at least onemachine-readable medium of claim 22, wherein the instructions todetermine the configuration comprises instructions to performing one ormore of minimizing an amount of data movement between the plurality oftasks and a storage device, minimizing latency of workload execution,maximizing resource utilization for the workload, maximizing capacity ofacceleration available resources, minimizing the power consumption ofthe workload, or optimizing the perf/watt or perf/$/watt metric for theworkload.
 25. The at least one machine-readable medium of claim 22,wherein the instructions to schedule the plurality of tasks compriseinstructions to: communicate from a first network-accessible processingunit to a second network-accessible processing unit via an applicationprogramming interface (API), to schedule a task of the plurality oftasks to execute using an artifact executing on a resource managed bythe second network-accessible processing unit.